library(tidyverse) # for graphing and data cleaning
library(gardenR) # for Lisa's garden data
library(lubridate) # for date manipulation
library(ggthemes) # for even more plotting themes
library(geofacet) # for special faceting with US map layout
theme_set(theme_minimal()) # My favorite ggplot() theme :)
# Lisa's garden data
data("garden_harvest")
# Seeds/plants (and other garden supply) costs
data("garden_spending")
# Planting dates and locations
data("garden_planting")
# Tidy Tuesday data
kids <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-15/kids.csv')
Before starting your assignment, you need to get yourself set up on GitHub and make sure GitHub is connected to R Studio. To do that, you should read the instruction (through the “Cloning a repo” section) and watch the video here. Then, do the following (if you get stuck on a step, don’t worry, I will help! You can always get started on the homework and we can figure out the GitHub piece later):
keep_md: TRUE in the YAML heading. The .md file is a markdown (NOT R Markdown) file that is an interim step to creating the html file. They are displayed fairly nicely in GitHub, so we want to keep it and look at it there. Click the boxes next to these two files, commit changes (remember to include a commit message), and push them (green up arrow).Put your name at the top of the document.
For ALL graphs, you should include appropriate labels.
Feel free to change the default theme, which I currently have set to theme_minimal().
Use good coding practice. Read the short sections on good code with pipes and ggplot2. This is part of your grade!
When you are finished with ALL the exercises, uncomment the options at the top so your document looks nicer. Don’t do it before then, or else you might miss some important warnings and messages.
These exercises will reiterate what you learned in the “Expanding the data wrangling toolkit” tutorial. If you haven’t gone through the tutorial yet, you should do that first.
garden_harvest data to find the total harvest weight in pounds for each vegetable and day of week (HINT: use the wday() function from lubridate). Display the results so that the vegetables are rows but the days of the week are columns.garden_harvest %>%
mutate(obs = row_number(),
weekday = wday(date, label = TRUE)) %>%
group_by(vegetable, weekday) %>%
summarize(tot_wt_dow = (sum(weight)* 0.00220462)) %>%
pivot_wider(names_from = weekday,
values_from = tot_wt_dow)
garden_harvest data to find the total harvest in pound for each vegetable variety and then try adding the plot from the garden_planting table. This will not turn out perfectly. What is the problem? How might you fix it?garden_harvest %>%
group_by(variety, vegetable) %>%
summarize(total_variety_wt_lbs = (sum(weight)* 0.00220462))
garden_harvest %>%
left_join(garden_planting,
by = c("vegetable", "variety")) %>%
group_by(variety, vegetable, plot) %>%
summarize(total_variety_wt_lbs = (sum(weight)* 0.00220462))
Discussion: The issue is that there are certain varieties of vegetables that have been planted in multiple plots, or were not recorded in a plot. As such, we see in the table that the instances in which a variety has been planted in multiple plots will appear as many times as places it has been planted. In these instances, the total harvest weight of that variety will appear as many times as places it has been planted, which might lead someone to believe that the total weight is multipilcatively more than it actually is. To fix this issue, it likely would’ve required the data entry to differentiate the varieties by the plot they’re in if the variety was planted in more than one plot. This would’ve allowed the grouping to differentiate between the plots and we could’ve calculated totals by variety and plot.
garden_harvest and garden_spending datasets, along with data from somewhere like this to answer this question. You can answer this in words, referencing various join functions. You don’t need R code but could provide some if it’s helpful.Discussion: To gain an understanding of how much money you saved by gardening, it would be helpful to understand how much the total weight of each vegetable and variety would cost at a grocery store. Using a grocery store’s website, you could calculate the total cost by multiplying their price per lb by the number of lbs you harvested in your garden. You could create a dataset with the grocery store prices and join that data set using a left_join() by vegetable/variety (assuming that you could find all varieties and vegetables at the grocery store). You could create a further complete dataset to help understand your savings by left joining that dataset with the garden_spending dataset. You could then, using all this information, create a variable that calculates the cost for each vegetable/variety at a grocery store, and then create another variable that shows the savings by subtracting your initial investment in the seeds/plant from the total grocery store cost to show an estimated savings.
garden_harvest %>%
filter(vegetable == "tomatoes") %>%
group_by(variety) %>%
summarize(first_harvest = min(date),
tot_wt = (sum(weight)*0.00220462)) %>%
arrange(first_harvest)
garden_harvest %>%
filter(vegetable == "tomatoes") %>%
group_by(variety) %>%
summarize(first_harvest = min(date),
tot_wt = (sum(weight)*0.00220462)) %>%
ggplot(aes(y = fct_reorder(variety, first_harvest, min), x = tot_wt)) +
geom_col() +
labs(title = "Total Tomato Harvest in lbs, Sorted by Weight of First Harvest",
x = "",
y = "")
garden_harvest data, create two new variables: one that makes the varieties lowercase and another that finds the length of the variety name. Arrange the data by vegetable and length of variety name (smallest to largest), with one row for each vegetable variety. HINT: use str_to_lower(), str_length(), and distinct().garden_harvest %>%
distinct(variety, .keep_all = TRUE) %>%
group_by(variety) %>%
mutate(variety_lower = str_to_lower(variety),
variety_length = str_length(variety)) %>%
group_by(vegetable) %>%
arrange(variety_length) %>%
summarize(vegetable,
variety_lower,
variety_length)
garden_harvest data, find all distinct vegetable varieties that have “er” or “ar” in their name. HINT: str_detect() with an “or” statement (use the | for “or”) and distinct().garden_harvest %>%
distinct(variety, .keep_all = TRUE) %>%
mutate(has_er_ar = str_detect(variety, "er|ar"))
In this activity, you’ll examine some factors that may influence the use of bicycles in a bike-renting program. The data come from Washington, DC and cover the last quarter of 2014.
{300px}
{300px}
Two data tables are available:
Trips contains records of individual rentalsStations gives the locations of the bike rental stationsHere is the code to read in the data. We do this a little differently than usualy, which is why it is included here rather than at the top of this file. To avoid repeatedly re-reading the files, start the data import chunk with {r cache = TRUE} rather than the usual {r}.
data_site <-
"https://www.macalester.edu/~dshuman1/data/112/2014-Q4-Trips-History-Data.rds"
Trips <- readRDS(gzcon(url(data_site)))
Stations<-read_csv("http://www.macalester.edu/~dshuman1/data/112/DC-Stations.csv")
NOTE: The Trips data table is a random subset of 10,000 trips from the full quarterly data. Start with this small data table to develop your analysis commands. When you have this working well, you should access the full data set of more than 600,000 events by removing -Small from the name of the data_site.
It’s natural to expect that bikes are rented more at some times of day, some days of the week, some months of the year than others. The variable sdate gives the time (including the date) that the rental started. Make the following plots and interpret them:
sdate. Use geom_density().Trips %>%
ggplot(aes(x = sdate)) +
geom_density() +
labs(title = "Distribution of Bike Rentals Between October and January 2014",
x = "",
y = "Density")
mutate() with lubridate’s hour() and minute() functions to extract the hour of the day and minute within the hour from sdate. Hint: A minute is 1/60 of an hour, so create a variable where 3:30 is 3.5 and 3:45 is 3.75.Trips %>%
mutate(hour = hour(sdate),
minute = minute(sdate),
time = hour + (minute/60)) %>%
ggplot(aes(x = time)) +
geom_density() +
labs(title = "Distribution of Bike Rental Times",
x = "Time of Day (Military Time)",
y = "Density") +
theme_minimal()
Trips %>%
mutate(weekday = wday(sdate, label = TRUE)) %>%
ggplot(aes(y = weekday)) +
geom_bar() +
labs(title = "Distribution of Bike Rentals by Day of the Week",
x = "Number of Rentals",
y = "")
Trips %>%
mutate(weekday = wday(sdate, label = TRUE),
hour = hour(sdate),
minute = minute(sdate),
time = hour + (minute/60)) %>%
ggplot(aes(x = time)) +
geom_density() +
facet_wrap(vars(weekday)) +
labs(title = "Distribution of Bike Rentals by Time of Day and Weekday",
x = "Time of Day (Military Time)",
y = "Density")
Observations: There is a distinct pattern present. On weekdays, as opposed to weekends, there are two distinct peaks in rental occurrences that coincide with the start and of the workday. These peaks are absent on the weekends where there is more of a standard, unimodal distribution that peaks a little after noon.
The variable client describes whether the renter is a regular user (level Registered) or has not joined the bike-rental organization (Causal). The next set of exercises investigate whether these two different categories of users show different rental behavior and how client interacts with the patterns you found in the previous exercises.
fill aesthetic for geom_density() to the client variable. You should also set alpha = .5 for transparency and color=NA to suppress the outline of the density function.Trips %>%
mutate(weekday = wday(sdate, label = TRUE),
hour = hour(sdate),
minute = minute(sdate),
time = hour + (minute/60)) %>%
ggplot(aes(x = time)) +
geom_density(aes(fill = client), alpha = .5, color = NA) +
facet_wrap(vars(weekday)) +
labs(title = "Distribution of Bike Rentals by Time of Day and Weekday",
x = "Time of Day (Military Time)",
y = "Density",
color = "",
fill = "Client Type")
position = position_stack() to geom_density(). In your opinion, is this better or worse in terms of telling a story? What are the advantages/disadvantages of each?Trips %>%
mutate(weekday = wday(sdate, label = TRUE),
hour = hour(sdate),
minute = minute(sdate),
time = hour + (minute/60)) %>%
ggplot(aes(x = time)) +
geom_density(aes(fill = client), alpha = .5, color = NA, position = position_stack()) +
facet_wrap(vars(weekday)) +
labs(title = "Distribution of Bike Rentals by Time of Day and Weekday",
x = "Time of Day (Military Time)",
y = "Density",
fill = "Client Type")
Discussion: Personally, I find this format to better at telling certain aspects of the story and worse at telling other parts. For instance, I believe that stacking these plots makes it more difficult to visually separate the trends for the casual and registered riders, even despite these trends still being visible in the stacked plots. This might just be personal preference. The stacking does help, however, in gauging the relative proportions of client types among riders and we can more directly compare the proportions of riders by time and day of week in the stacked version. So, the stacked version definitely has its strengths, but that’s not to say it’s without its weaknesses too. For instance, I find that its less intuitive to read the stacked version. Without knowing exactly what the stacking is representing and how to read it, it’s less accessible than the alternative. Ultimately, I think both are effective story telling devices and it probably depends more on individual taste and what you’re used to seeing.
position = position_stack()). Add a new variable to the dataset called weekend which will be “weekend” if the day is Saturday or Sunday and “weekday” otherwise (HINT: use the ifelse() function and the wday() function from lubridate). Then, update the graph from the previous problem by faceting on the new weekend variable.Trips %>%
mutate(weekday = wday(sdate, label = FALSE),
hour = hour(sdate),
minute = minute(sdate),
time = hour + (minute/60),
weekend = ifelse(weekday > 5, "Weekend", "Weekday")) %>%
ggplot(aes(x = time)) +
geom_density(aes(fill = client), alpha = .5, color = NA) +
facet_wrap(vars(weekend)) +
labs(title = "Weekday vs Weekend: Distribution of Bike Rentals by Time of Day",
x = "Time of Day (Military Time)",
y = "Density",
fill = "Client Type")
client and fill with weekday. What information does this graph tell you that the previous didn’t? Is one graph better than the other?Trips %>%
mutate(weekday = wday(sdate, label = TRUE),
hour = hour(sdate),
minute = minute(sdate),
time = hour + (minute/60),
weekend = ifelse(weekday > 5, "Weekend", "Weekday")) %>%
ggplot(aes(x = time)) +
geom_density(aes(fill = weekday), alpha = .5, color = NA) +
facet_wrap(vars(client)) +
labs(title = "Weekday vs Weekend: Distribution of Bike Rentals by Time of Day",
x = "Time of Day (Military Time)",
y = "Density",
fill = "Client Type")
Discussion: This graph supposedly helps provide more nuanced information about the specific days of the week, which was absent from the previous graph. Essentially, we can see if certain times of the day are more popular for certain days of the week and overlay them to create a tapestry of the full week. Personally, I don’t believe this graph provides much better information than the previous graph, and I believe this information would be better presented in a faceted form. Perhaps if the color scheme was better I would have more affinity towards it, but its current form is, in my opinion, somewhat needless.
Stations to make a visualization of the total number of departures from each station in the Trips data. Use either color or size to show the variation in number of departures. We will improve this plot next week when we learn about maps!Trips %>%
group_by(sstation) %>%
summarize(n_departures = n()) %>%
arrange(desc(n_departures)) %>%
left_join(Stations,
by = c("sstation"="name")) %>%
ggplot(aes(y = lat, x = long, color = n_departures)) +
geom_point() +
scale_color_gradient(low="blue", high="red") +
labs(title = "Spatial Heatmap of Station Popularity",
x = "Longitude",
y = "Latitude",
subtitle = "Total Number of Departures at Each Station",
color = "Departures")
Trips %>%
group_by(sstation) %>%
mutate(casual = client == "Casual",
registered = client == "Registered") %>%
group_by(sstation) %>%
summarize(prop_casual = sum(casual)/(sum(registered) + sum(casual))) %>%
arrange(desc(prop_casual)) %>%
left_join(Stations,
by = c("sstation"="name")) %>%
ggplot(aes(y = lat, x = long, color = prop_casual)) +
geom_point() +
scale_color_gradient(low="blue", high="red") +
labs(title = "Spatial Heatmap of Most Popular Stations for Casual Clients",
x = "Longitude",
y = "Latitude",
color = "Percent Casual",
subtitle = "Proportion of Total Rentals by Causal Clients at Each Station")
Observations: It seems that most stations are used mostly by registered clients as most points are blue or purple. There are some clusters of stations with higher percentages of casual clients in the top left corner and in the center of the main cluster around 38.9 lat. It’s difficult to really denote any patterns beyond this, however.
as_date(sdate) converts sdate from date-time format to date format.TopTenSD <- Trips %>%
mutate(date = as_date(sdate)) %>%
group_by(date, sstation) %>%
count() %>%
arrange(desc(n)) %>%
head(10)
TopTenSD
New_Trip <- Trips %>%
mutate(sdate = as_date(sdate))
TopTenSD %>%
select(date, sstation) %>%
inner_join(New_Trip,
by = c("date" = "sdate", "sstation"))
TopTenSD %>%
select(date, sstation) %>%
inner_join(New_Trip,
by = c("date" = "sdate", "sstation")) %>%
mutate(weekday = wday(date, label = TRUE),
casual = client == "Casual",
registered = client == "Registered") %>%
group_by(weekday) %>%
summarize(percent_casual = sum(casual/(sum(casual)+sum(registered))),
percent_registered = sum(registered)/(sum(casual)+sum(registered)))
Interpretation: From the table above, we can see that the vast, vast majority of bike rentals during the weekdays are by registered clients. In fact, registered clients make up over 90% of all rentals every day of the work week. This pattern nearly completely flips, however, on the weekends. On Sat. and Sun., over 80% of all rentals are by casual clients. This makes sense as people who are using the bikes for commuting to work would be more likely to become registered clients, whereas people who are using the bikes on the weekends are likely tourists or people who don’t require the bikes as frequently.
Link: https://github.com/greynolds1121/Reynolds_Week_3_HW/blob/main/Reynolds_HW3.md
This problem uses the data from the Tidy Tuesday competition this week, kids. If you need to refresh your memory on the data, read about it here.
facet_geo(). The graphic won’t load below since it came from a location on my computer. So, you’ll have to reference the original html on the moodle page to see it.kids %>%
filter(variable == "lib") %>%
group_by(state, year) %>%
summarize(variable,
spending = inf_adj_perchild) %>%
ggplot(aes(x = year, y= spending)) +
geom_smooth(method = "lm", se = FALSE) +
facet_geo(vars(state), scales="free")